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sj^^^' ABSTRACT 

Context. Most of what is known about clustered star formation to date comes from well studied star forming regions located relatively 
nearby, such as Rho-Ophiuchus, Serpens and Perseus. However, the recent discovery of infrared dark clouds may give new insights in 
our understanding of this dominant mode of star formation in the Galaxy. Though the exact role of infrared dark clouds in the formation 
process is still somewhat unclear, they seem to provide useful laboratories to study the very early stages of clustered star formation. 
Infrared dark clouds have been identified predominantly toward the bright inner parts of the galactic plane. The low background 
emission makes it more difficult to identify similar objects in mid-infrared absorption in the outer parts. This is unfortunate, because 
'^ ' ' the outer Galaxy represents the only nearby region where we can study effects of different (external) conditions on the star formation 

^N ' process. 

Cn , Aims. The aim of this paper is to identify extended red regions in the outer galactic plane based on reddening of stars in the near- 

infrared. We argue that these regions appear reddened mainly due to extinction caused by molecular clouds and young stellar objects. 
The work presented here is used as a basis for identifying star forming regions and in particular the very early stages. An accompanying 
Wh' paper describes the cross-identification of the identified regions with existing data, uncovering more on the nature of the reddening. 

f^ , Methods. We use the Mann- Whitney U-test, in combination with a friends-of-friends algorithm, to identify extended reddened regions 

< I , in the 2MASS all-sky JHK survey. We process the data on a regular grid using two different resolutions, 60" and 90". The two 

"*^ ' resolutions have been chosen because the stellar surface density varies between the crowded spiral arm regions and the sparsely 

^ ' populated galactic anti-center region. 

Results. We identify 1320 extended red regions at the higher resolution and 1589 at the lower resolution run. The linear extent of the 
identified regions ranges from a few arc-minutes to about a degree. 

Conclusions. The majority of extended red regions are associated with major molecular cloud complexes, supporting our hypothesis 
^ ' that the reddening is mostly due to foreground clouds and embedded objects. The reliability of the identified regions is >99.9%. 

lO I Because we choose to identify object with a high reliability we can not quantify the completeness of the list of regions. 
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^ 1. Introduction (-100-10^ M^: lEgan et al.lfl998l: ICarev et alJll998l l2000l) and 

(f~5 T^ , , , , ,. „ „ . . they are located at larger distances (>1 kpc) compared to well- 

^ Dark clouds represent the earliest stages of star formation in ^^^^.^^ j^^,^^^^ ^^^ f^^^. ^ .^^^^ ^ ^^^^^^^j ^lished cat- 

. the Galaxy. Nearby dark clouds have been studied in great de- alog of dark clo uds extracted from the GLIMPSE Survey by 
> , tail and provide a wealth of infor mation on t he low -mass iso- [peretto & Fulled (l2009l) shows that most IRDCs have column 
^ ; lated star formation process (e.g., I ShuetalJ ll987|). However, ^^^^.^.^^ ^^j^^ N(H2)-5x 10^^ cm^ with some extreme objects 
^ . "^oMfs do not form in isolation but m groups or clusters ^^^^^. n(H2)>1023 cm^. This new class of dark clouds are 
^ . dLada^Lada 2003), and this clustered mode of star formation ^^j^^^^^ ^^ represent the very early stages of clustered star for- 
IS not well understood. Some key questions of current star for- nati on, hence commonly re ferred to as cluster-forming clumps 
mation research are the origin or the stellar mass distribution, / iRathhnrnp et al IboOfih 
the relation between high-mass star formation and stellar clus- 
ters and the effect of environment on the star formation process. Whil^ most of the star-forming activity in the Galaxy takes 
See Zinnecker & Yorke ( 2007) for a recent review of this topic. Pl^^e in the inner spiral arms and the moleculai- ring, signifi- 
In the mid 1990s, infrared surveys with the Infrared Space ^ant activity also occurs beyond the solar circle (R, > 8.5 kpc). 
Observatory (ISO) and the Midcourse Space Experiment (MSX) Prominent examples include well studied nearby (<500pc) re- 
have revealed a c lass of dark clouds, r eferred to as Infrared Dai'k 81°"^ such as the Orion and Taurus-Perseus-Aunga complexes 
Clouds (IRDCs lPeraultetal.1 \m6; 'EgaiLetalJ [i998h . These ^^ well as more distant (>2 kpc) active star forming regions such 
clouds are observed in silhouette against the bright infrared ^^ W3 and NGC 7538. However, the outer Galaxy is generally 
background emission of the galactic plane. IRDCs are typically somewhat neglected in studies of star formation compared with 
cold (<25 K) and dense (-10^ cm-^) with masses ranging from *e inner Galaxy, which is unfortunate because it is a useful lab- 

oratory to study the effect of different external conditions on the 

Send ojfprint requests to: W.F.F. Frieswijk process that initiates stellar birth. In particular, the metallicity. 
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pressure, and radiation field in the outer Galaxy are corisider- 
ably different from the inner Galaxy (iBrand & Wouterlootll995t 
[Rudolph et al. 2006), which makes it by far the most nearby re- 
gion to study the effect of such variations on star formation. 

The aim of this work is to locate star-forming regions in 
the outer galactic plane. To find concentrations of molecular 
gas in the outer galactic plane, where the mid-infrared back- 
ground is insufficient to measure absorption, we use the extinc- 
tion to backgrou nd stars. The traditional method to do this is via 
star counts (e.g., ' Wolflll923l) . which is limited by extinction to 
nearby clouds, especially when using optical data. This problem 
is avoided by using near-infrared data, but pick-up of emission 
from embedded objects makes this method unreliable. 

An alternative to star counts is to use the near-infrared 
colour excess (NICE, Lada et alj 1 19941) which is widely used 
to map the extinction toward d ark clouds (e.g ., ICambresv et al.l 
[20021 lAlves et all Il998t iLada et al.1 [J^. However, this 
method relies on knowledge of the distance to the object, 
so that a correction for foreground stars can be made. The 
presence of foreground stars can reduce the measured color 
excess, and h ence may influen ce an identification based on this, 
significantly (iLombardill2005h . Furthermore, identifying objects 
in an automated, uniform way toward large regions in the sky is 
difficult using NICE because the extinction law is not uniform in 
every direction. Also, an average reference color of background 
stars is required, which is usually adopted from a nearby, well 
selected, extinction-free region. 

The identification of dark clouds by the extinction they 
impose upon background stars requires a method that: 

1) determines the extinction from the colors of stars, 

2) can be applied to small numbers of stars (for high res- 
olution), 

3) can be treated uniformly in all directions, 

4) puts a confidence level to measurements, 

5) is independent of foreground stars. 

Because of the above mentioned limitations when using a near- 
infrared color excess method, we have chosen to search for can- 
didate dark cloud regions in a statistical manner rather than 
based on the absolute value of the extinction. 

In two papers, we describe the identification process of 
candidate regions in the outer galactic plane with the goal to 
build up a sample of objects that can be used for future studies 
of (clustered) star formation throughout the Galaxy. 

In this first paper we describe the Mann-Whitney U-test. 
This distribution-free statistical test can be applied uniformly 
to large data-sets in order to identify parts of the data with 
deviating properties. The purpose of this paper is to identify 
extended red regions in the sky using data from the Two Micron 
All Sky Survey (2MASS). In an accompanying paper (Frieswijk 
et al. 2010, hereafter Paper II) we perform a cross-correlation 
between the red regions presented here and existing data (optical 
dark clouds, MSX, IRAS and CO where available). The purpose 
of the second paper is to provide an indication of the actual 
nature of the reddening toward the identified regions 

In Section|2]we summarize the 2MASS data and discuss why 
we use the {H - Ks) colors for the identification. In Section 3 
we explain the Mann- Whitney U-test and describe the procedure 
that we follow in order to determine the statistics. In Section 4 
we describe the friends-of-friends approach that is used to ex- 
tract extended red regions from the reliability images produced 
by the U-test. In Section 5 we present the results. A catalog of 



the extended red regions is available onlinqj and at the CDS. The 
reliability images of the entire outer galactic plane can be down- 
loaded in FITS format from the same website. The final section 
summarizes our main conclusions. 



2. Data: the Two Micron All Sky Survey 

The Two Micron All Sky Survey (2MASS) scanned the entire 
sky uniformly in three near-IR bands: J (1.25/vm), H (1.65yum) 
and Ks (2.17^). The 2MASS Point Source Catalog (2MASS 
PSC. ISkmtskie et al..2006) consists of accurate positions (astro- 
metric accuracy RMS <200 mas) and brightness information for 
over 400 million stars. The point source catalog is > 99 % com- 
plete up to 7<15.8, H<15.1 and K5<14.3 magnitudes. Thepho- 
tometrical signal-to-noise ratio is > 10 (i.e. o-jh,k, ^ 0.1 mag) 
for sources brighter than the above stated completeness limits. 
The so-called faint extension of the point source catalog includes 
sources that reach 0.5 to 1.0 magnitude beyond the above lim- 
its. The completeness, reliability and uniformity of the faint ex- 
tension sources are not as good as the high-reliability PSC and 
photometric errors increase up to crjf],K^ ~ 0.4 magnitude. We 
make use of all available sources in 2MASS, including the faint 
extension. This increases the total number of objects by almost a 
factor 2 compared to star count studies, where the faint extension 
cannot be used because the data are incomplete. 

The (H - K5) color is an appropriate choice to perform the 
identification of red regions because the intrinsic (H - Ks) color 
of stars spans only a narrow range. For spectral types AOV to 
M6III the range is abo ut 0.00 to 0. 30 mag ( Bessell & Brett 19881: 
IWainscoat et al.lfl992h . In the Solar neighbourhood the average 
(H - K5) color is about 0.1 3 mag with a standard deviation of 
0. 1 mag. This assumes no reddening due to extinction and takes 
into account the sensitivity limit of 2MASS (14.3 mag for K^- 
band). By including the faint extension, the average (H - Ks) 
color increases slightly (0.15 mag) due to faint, intrinsically red- 
der M-dwarfs. If a region appears significantly redder, this sug- 
gests that either foreground extinction is present or the intrinsic 
color distribution of stars in that direction is different. The latter 
can be important if for example a group of young stellar objects, 
which are intrinsically redder, is located in the observed field. 
Though such regions will be identified being intrinsically red 
rather than reddened, they still represent star forming regions. A 
discrimination between the two, which is not obvious from the 
work in this paper, can be done by looking at other available data 
and will be discussed in Paper II. 

We analyse the 2MASS data in the outer Galaxy from /=90° 
to 270° and ^=-3.5° to +3.5°. In order to process the data on 
a desktop computer, we divided the area up in regions of 2° x 
2°, with a sampling separation between different fields of one 
degree. 



3. Method 

3.1. The Mann-Whitney U-test 

The goal of this project is to identify, based on near-IR (H - Ks) 
colors, those regions which are redder compared to the back- 
ground. One potential approach is to calculate an average color 
of a few stars and test whether that average is different than 
a background sample of stars. This approach has some signif- 
icant drawbacks: how many sample stars should be used and 
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what is the stellar color of the background given that this value 
changes over the Galactic Plane? Is the average the correct statis- 
tic to calculate? Why not use the median or some other statistic? 
Furthermore, the distribution of the colors of stars is not a stan- 
dard parametric distribution. It is definitely not a Normal distri- 
bution which is needed if we want to use the average colors of 
stars and calculate a confidence based on a standard deviation. 
What is needed is a statistic, which does not depend on the distri- 
bution of stellar colors and which works when comparing small 
samples against a background of many stars in a robust manner 

The Mann- Whitney U-test (also referred to as the Wilcoxon 
rank-sum test, hereafter referred to as U-test) is a well- 
known non-para metric significance test first proposed by 
Wilcoxonl (Il945h and e xtended to arbitrary sample sizes by 
Mann & WhitnevI (Il947h . The statistic is based on the rank of 
the observation and not the observation value itself. This makes 
it a non-parametric statistic. The test is used to assess whether 
two samples of observations, say sample A and B, come from 
the same distribution. Samples A and B are required to be in- 
dependent and the observations should be ordinal or continuous 
measurements. The U-test is hypothesis testing, where the hy- 
pothesis is that the two samples come from the same distribution. 
This is called the Null hypothesis. The statistic which is calcu- 
lated is the sum of the joint ranking (U) of all the observations, 
in our case the (H - Ks) colors. 
An example 

The distribution of the sum of ranks of A and B, is readily calcu- 
lated for samples drawn from the same parent distribution. Take 
the case of 3 observations against 3 other observations. If the ob- 
servations are truly randomly drawn from the same parent dis- 
tribution, then one would expect a lower chance of pulling the 
three highest values for sample A (ranks 4,5,6) and B therefore 
resulting in the three lowest ranks 1,2,3. One would rather ex- 
pect a more random mix, e.g., 1,4,6 and 2,3,5. In the example 
above the sum of the ranks in the first case, f/A=15, f/B=6 is less 
likely than the second sum of ranks, {/a = 1 1, Ub-^Q- In this way 
the distribution of rank sums is built up. 

If the rank sum statistic is Ua-^^, the probability of this oc- 
curring randomly is small (5%), which means the Null hypoth- 
esis can be rejected on a statistical basis to a particular level of 
confidence (95%), in favor of the alternative or test hypothesis 
that the A samples are larger than the B samples. The alternative 
hypothesis must be created in such a way that it is the only al- 
ternative to the Null hypothesis. Rejecting the Null hypothesis 
means that the alternative hypothesis must be accepted at that 
level of confidence. 

The U-test can be applied to large scale surveys in order to iden- 
tify regions where source properties deviate significantly with 
respect to a reference field. We use the U-test statistic to test 
whether a set of colors of stars is different than the set of colors 
of stars from another sample. We test against the Null hypothesis 
that the set of colors is actually the same. When the test rejects 
the Null hypothesis (to a certain confidence level), we must ac- 
cept the test hypothesis that the colors are actually different. As 
can be seen from the example above, there is a fundamental dif- 
ference if Ua is greater or less than Ub- For near-IR colors this 
means we could test for stars redder or bluer than the background 
stars. 

For this work, we want to compare as few stars as possible 
against a background of tens of thousands. The benefit of us- 
ing a non-parametric test is that whatever size the two samples 
have, the distribution of the U-test statistic is known from pre- 
calculated tables (e.g.. Wall & Jenkins 2003), unlike for a para- 
metric test. However, tables for finding the probabilities of U 



usually do not contain values for samples with size in excess of 
20. For large sample sizes though, the distribution will approach 
a Normal distribution thanks to the Central Limit Theorem. The 
mean and standard deviation of the variables is simply given 
by the mean and standard deviation of the Normal distribution. 
iMosesI d 19641) stipulates that the smallest number of samples is 
5, to still be able to use the Normal approximation for the U-test. 
For a parametric test such as the average, a Normal approxima- 
tion is only valid for higher sample sizes. This is evaluated in 
Section 3.5. 



3.2. Applying the U-test procedure 

We perform our calculations on a regular grid. For each grid- 
cell the distribution of properties, i.e., star colors, is referred to 
as sample A. A reference field represents sample B and is de- 
termined locally. The choice of the reference field is explained 
in Section [331 The size of the grid-cells can be chosen at will, 
but a minimum number of sources (5; see Sec. 13.51 ) is required 
to retrieve a reliability based on the Normal approximation. The 
choice of the resolution is explained in more detail in Section 
EH 

The following steps are involved in the calculation of the 
probability P for any given grid-cell, where P is the probability 
that a cell contains a redder color distribution compared to the 
reference field. The whole procedure is schematically given also 
in Figure [H 

1) Let A be the distribution of colors in the cell (with m 
members) and B the local reference distribution (with 
n members, where n is very large in this study, i.e., 
-10'*"^). The Null hypothesis (Hq) is that A and B have 
the same parent population, or in perspective of our 
work, that the parent population of A is given by B. 
Note that when m is less than 5, the probability P is 
set to 50% and the next cell is processed. 

2) Rank the colors in ascending order for the combined 
members of A and B. Preserve the A or B identity for 
each member. 

3) Sum the number of A-rankings to get the statistical 
value Ua- 

Generally, numerically equivalent (also known as tied) obser- 
vations are assigned with the average of their respective ranks. 
However, we avoid tied observations by adding very small ran- 
dom numbers to the color values (<10 '*). This has no impact on 
the results, but makes the computations easier For large samples, 
the sampling distribution tends to a Normal distribution (see also 



Sec. 13.5b with a mean /i^ and variance cr^ given by 

Ha = m(N + l)/2 and crl = mn(N + 1)/12, 



(1) 



respectively, where N = m + n. The significance, z, can then be 
assessed from the Normal distribution by calculating 



t/A-jUA±0.5 

O-A 



(2) 



The term ±0.5, in statistics often referred to as the continuity 
correction factor, is a correction required to maintain continu- 
ity when a discrete distribution, such as the U-test statistic, is 
approximated by a continuous distribution, such as the Normal 
distribution. Because the Normal distribution consists of all real 
numbers, but the U-test values are discreet (integer values), a 
Normal approximation should identify the discreet event 'n' 
with the normal interval '(n-0.5, « -1-0.5)', where n is any integer 
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1) Define A and B 





B = {m, 
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Fig. 1 : Schematic representation of the U-test procudure. 



value. In this situation we use -0.5 (instead of +0.5) because we 
consider only objects in the upper tail, i.e., where the distribution 
is redder (instead of bluer) than the reference distribution. The 
probability P is calculated by evaluating the integral from -inf to 
z of the Gaussian probability density function <l>(z), where <l)(z) 
is given by 



<I>(z) 



3.3. Reference distribution 



1 



„-J-/2 



^/2^ 



(3) 



A single reference distribution is almost certainly not represen- 
tative for every field along the galactic plane. First, because the 
stellar population, and thus the color distribution, may vary with 
galactic coordinates. Second, because foreground extinction will 
be present everywhere. The reddening caused by the extinction 
depends on the amount of foreground material and on the dust 
properties toward different lines of sight. Although the overall 
extinction features are interesting, the target for this work is to 
find extreme color changes relative to the local environment. 



Therefore we make no attempt to extract large scale extinction 
features along the galactic plane. 

Instead of using a single reference distribution everywhere, 
we define the distribution of colors in every 2° X 2° field as a 
reference for the grid cells in the inner square degree of the re- 
spective field. This means that we select a local reference dis- 
tribution as best representation for the colors in each field, thus 
avoiding the effect of color variations on galactic scale. Note that 
the colors of stars in the cells that are being compared to the local 
reference are omitted from the reference distribution itself. 



3.4. Resolution iimitation 

The limiting factor for the spatial resolution that can be achieved 
in performing the U-test is determined by the minimum of 5 
sources required for the Normal approximation. As a require- 
ment for the grid-cell size (resolution) we state that in every 
field at least 75% of the cells should contain 5 or more sources. 
Because the stellar surface density decreases toward the galactic 
anti-centre (/=180°, b=0°), we process the data at two different 
resolutions. The highest possible resolution, mostly applicable 
towards the spiral arm regions at /~90°-140° and Z~240°-270°, 
is 60". For the region between /~140°-240° we require a grid 
with cells of 90". Note that we process the entire outer galactic 
plane at both resolutions and all data are made available. The 
average number of stars is about 8 per cell for the high resolu- 
tion grid toward the spiral arm regions as well as for the low- 
resolution grid toward the anti-center region. 

3.5. Minimum requirements: a Monte Cario approacli 

In the U-test we assume that even for small samples drawn from 
a large set of sources, the distribution of the significance values 
(z) tends to Normal. With a Monte Carlo simulation we show that 
the significance distribution of randomly drawn samples from a 
large distribution is indeed close to a Gaussian probability den- 
sity distribution. We specifically simulate samples of 8 sources 
because this is the average number of sources that are present in 
a grid cell. The large distribution is extracted from an observed 
field centered on 1=122°, b=l°. We have performed the same 
simulation using various other fields to test whether the results 
below change when less, or more reddening is present in a field. 
We find that this is not the case. 

The resulting distribution of z is given in the lower left panel 
in Figure |2] In the upper left panel we display the distribution of 
z for samples of 5 sources , the minimum sample size for a proper 
U-test result according to'M osesI (11964 '). The peak as well as the 
wings are slightly overestimated by the Gaussian profile in the 
simulation for samples of 5 sources, but increasing to samples of 
8 shows that the simulation converges fast to a Normal distribu- 
tion. The difference between the the areas under the curves for 
the wing region (z > 2) is less than 10% (<5% for samples of 8). 
By assuming that the profile is given by a Gaussian distribution, 
we underestimate the actual probability of having a red pixel in 
the regime considered here, i.e., z > 2.326 corresponding to a 
probability P >99% when evaluating the integral over Equation 
3. 

We compared the simulated results of the U-test with two 
parametric tests: the average and median. The right panels of 
Figure |2] show that the distribution of average values deviates 
much more from a Gaussian profile compared to the U-test. The 
distribution is clearly skewed and cannot be represented by a 
simple profile. The area in the wing (H - Ks> 0.5 mag) is un- 
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derestimated by a;50% (a; 100% for 8 samples). The shape of the 
distribution of average values approaches a Normal distribution 
only for sample sizes > 20, but the area in the wing remains 
underestimated. Almost identical results are found for the distri- 
bution of median colors, with a comparably large discrepancy in 
the wing area. 

The average and median colors are often used, e.g., for near- 
infrared color excess (NICE). Even though the median value is 
less sensitive to (extreme) outliers in the distribution than the 
average, the main issue is, that both are not well represented by 
simple profiles. This makes it difficult, if not impossible, to put a 
significance to the measured values. This justifies the use of the 
non-parametric U-test statistics for the automated identification 
process presented in this paper. 

3.6. Reliability and Completeness 

There are several considerations that need to be kept in mind 
when interpreting the resulting images created by the U-test 
method. These are related to the reliability and completeness 
of the cells that are identified as being red. We discuss them here. 

Type I error 

The first error is related to the reliability, or the significance 
level of the selected cells. If, as in our case, rejection is defined 
at 99% this suggests that about 1% of the cells will be selected 
on statistical arguments, rather than being red. However, these 
cells are expected to be randomly distributed over the fields. 
To exemplify that this is indeed the case, we consider the 
following values extracted from the low-resolution 2° x 2° 
field toward 1-166°, b=l°. This field is assumed to be empty, 
because no objects, identified as clusters of red cells, ended up 
in a final identification. However, the field contains 76 cells 
out of a total of 6561 (~ 1%) where the Null hypothesis of the 
U-test was rejected. Indeed, these cells are spread more or less 
randomly over the field, or else they would have been identified 
as clustered objects. 

Type II error 

This error is related to the cells that are not identified, i.e., 
where the Null hypothesis is not rejected but should have been. 
It is very difficult to put a realistic value to this error. In a broad 
sense it is inversely related to the Type I error. That is, the more 
reliable the sample is the less complete it is and vice versa. 
Furthermore, the Type II error usually occurs when sample 
sizes are too small. Because we have chosen to produce a high 
reliability catalog, we are giving in on completeness due to ffiis 
error 

Sample statistics 

According to Moses (1964), when one of the distributions that 
go into the U-test contains fewer than 5 sources the outcome 
of the U-test is unreliable. There is not sufficient information 
available on the color distribution. Cells containing fewer than 
5 sources are omitted in the identification process and, thus, 
introduce another form of incompleteness which depends on 
the resolution. Processing at higher resolution means being less 
complete. 

Considering the limitations and reliability of the method, 
we have decided to put the effi^rt in producing a reliable target 
list at relatively high resolution (60" and 90"), rather than being 
complete. Furthermore, our intention is not to evaluate the 
content or structure of the outer galactic plane nor finding all 



'dark clouds' present in the outer Galaxy. We purely present a 
statistical method that enables the identification of real objects. 

4. Extracting candidate sources 

Friends- of -friends approach 

To increase the reliability of red regions and to avoid the 
selection of random cells, we add an additional step to the 
process. By means of a friends-of-friends algorithm, we select 
only clusters of 4 or more red cells for the final list of candidate 
objects. The reason for using 4 cells here is based on the prob- 
ability of identifying clusters of randomly distributed red cells, 
which becomes negligible (<<1%) for groups larger than 3 
cells. These probabilities depend on the linking length described 
below and were evaluated in the Monte Carlo simulatio ns. The 
friend s-of-friends algorithm, first applied by Huchra & Gellerl 
(Il982h and often encountered in cosmological studies, is a 
well-known technique and can be simplified to find groups, or 
structures, in projection on the sky. We use the friends-of-friends 
approach to identify groups of cells on the sky where the color 
distribution of stars is red compared to the local surroundings, 
i.e., where the U-test returns a probability of P- 99% or higher. 

Linking length 

A crucial parameter required for the friends-of-friend method 
is the linking length, L, used to determine whether selected 
cells belong to the same group or not. If L is taken too large, 
all cells will be assigned to the same group. If L is too small 
all cells will be considered isolated. Because we do not know 
the cause of the red color distribution in different cells (either 
extinction due to a foreground material at unknown distance, or 
a concentration of intrinsically red objects at unknown distance) 
it is impossible without further information to put a physical 
size to L. We therefore choose a linking length in terms of the 
cell size, ^ceii- 

An upper limit for .Sceii can be estimated from an empty field. 
Consistent with the defined reliability, about 1 % of the cells in 
an empty field are rejected by the U-test and spread randomly 
over the field. We assess an average distance between rejected 
cells by using a Monte Carlo simulation, where A^ cells (=1% 
of fl^) are selected at random locations from a square box with 
side a. For a^ we use the number of cells that make up a 2° x 2° 
field at the given resolutions, i.e., a=81 cells at 90" and a-\2\ 
cells at 60" resolution. The average distance in the simulation 
converges to 5.3 and 5.2 times the cell-size, respectively. This is 
in agreement with the theoretically expected value, given by 



1 axa 
average distance - t; \l =5 [x Sceii] 



(4) 



(lHertzlll909tlClark & Evansl[T95l . 

A finking length much larger than this average distance will 
group cells together even if the distribution is random. This ob- 
viously contradicts the goal of finding clustered regions. A lower 
limit is obviously the cell size itself. We have tested the method 
by varying L between 2 and 5. The higher values (>3) identify 
mostly large scale structures (>0.5°). For a value of 2, most cells 
are isolated or grouped with fewer than 3 other cells. We decide 
to use a linking length half that of the average distance between 
random points, i.e., L=2.5 times the cell size, for both resolu- 
tions. The probability for finding 'accidental groups' of 4 cells 
using L-2.5 and considering that 1% of the cells are randomly 
selected by the U-test was evaluated in the Monte Carlo simula- 
tions described above and is <0.1%. 
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Fig. 2: Visualization of the resuh of a Monte Carlo simulation, where we randomly draw 6000 samples of 5 (upper histograms) and 
8 (lower histograms) 2MASS colors from a test field at /=122°, ^7=1°. The distributions of the significance z are displayed on the 
left. The right histogram displays the distribution of average colors in the same simulation. A blow-up of the high-end tail of each 
distribution is given in the narrow panels right of the histograms. The solid lines in each panel represent Gaussian profiles where the 
peak position and FWHM are adopted from the simulated distributions. 



A consequence of setting the linking length to 2.5 is that 
groups are identified with cells in between the red ones that may 
not be identified as red. This can either be due to a Type II er- 
ror (Sec. 3.6), or insufficient stars present in that cell (the sam- 
ple statistics. Sec. 3.6). In both cases this illustrates that not all 
reddened regions can be detected and thus implies that we are 
limited in the completeness of the catalog. 

5. Results 

Figure [3] displays an example of the output of the U-test in the 
form of a probability image for the region centered on /=! 12°, 
b=l°. For comparison we display the 8 ;um emission as observed 
by MSX. We identify 24 extended red regions in this field. The 
red cells of these regions are presented in different colors in Fig. 
[3] with the spatial extent of the objects indicated by the ellipses. 
Some objects clearly correlate with extended bright 8 yum emis- 
sion coming from well-known star forming regions such as NGC 
7538 or S 159. This field also contains object H474, which has 
been selected as a 'massive dark cloud' -candidate for follow- 
up observations. These observations have revealed the first in- 
frared dark cloud identified as such in the outer galactic plane 
(iFrieswiik et alJl2007il2008h . 

The resulting catalog of extended red regions and the 
probability images in FITS-format are available onlinqj. The 
low-resolution part consists of 1589 objects and the high 



resolution part of 1320. The catalog is available in electronic 
form at the CDS with sources designated FrSh LNNNN and 
FrSh HNNNN for low- and high resolution objects, respectively. 
Table 1 presents an excerpt from the high resolution catalog. 
The objects in the table correspond to those that are identified 
in the field displayed in Figure 3 and include NGC 7538 (FrSh 
H463) and S 159 (FrSh H470). The diff^erent columns present 
the following information: 

Column (1): Object identification 

Column (2)-(3): Galactic coordinates of the object centre 

Column (4): Total number of red pixels 

Column (5)-(6): Extent in / and b in pixel units (1 pix=r) 

Column (7): Number of 2MASS sources 

Note that the number of 2MASS sources in Column (7) 
contains information on the stellar surface density. In principle 
this can be used to distinguish between objects identified by 
extinction and objects representing a clustering of young stars, 
because extinction may lower the stellar density whereas a 
clustering may increase it. However, because we are using the 
2MASS catalog including the faint extension the variations in 
stellar surface density may also arise from the incompleteness 
of the catalog and we do not discuss it further in this paper. 



http;//www.astro.rug.nl/~ismgroup/OuterGalaxy/ 
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Fig. 3: The left panel displays the probability image returned by the U-test. The grey-scales represent the probability that the color 
distributions in the cells are redder with respect to the reference distribution. The friends-of-friends method identified the extended 
red regions, displayed in arbitrary colors. The ellipses give the spatial extent of the regions. Black cells are identified as being red, 
but are considered isolated. Objects 463 and 470 coiTespond to NGC 7538 and S 159, respectively. Object 474 is the first outer 
Galaxy 'cluster-forming clump' -candidate, identified as such by the work presented here. The right image shows the same region 
in MSX 8 fj.m emission for comparison. Bright emission from warm dust, heated by embedded, massive young stars, is seen toward 
both NGC 7538 and S 159. The upper part of the image is almost devoid of extended 8 yum emission, typical for the majority of 
directions in the outer Galaxy. 



5.1. Size- and spatial distribution 

The linear extent of the extended red regions ranges from a few 
arc-minutes to about a degree. Figure |4| gives an overview of 
the cell-number distribution for the regions in the high- and low 
resolution catalog. This shows that about half of the regions are 
identified as groups of 4 or 5 cells. Regions consisting of more 
than 10 cells account for 18% of the objects in the high resolu- 
tion catalog and for 28% of the objects in the low resolution one. 
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Fig. 4: Histogram displaying the distribution of the number of 
cells per extended red region for the high-resolution (solid) and 
low-resolution (dashed) catalog. 



Figure |5] shows the sky distribution of the extended red re- 
gions in the outer galactic plane. The solid histogram in the 
lower panel represents the number of high-resolution regions 
in bins of 5° galactic longitude. The dashed histogram dis- 
plays the distribution of low-resolution regions. The galactic 
latitude distribution is given in the horizontal histogram in the 
right panel. Several authors have found that the distribution 
of (molecular) clouds toward the inner galactic plane peaks at 
negative galactic latitu de (Peretto & Fuller 2009; Schulleret al] 
I2OO9I: iRosolowskv et al . 2009), suggested to result from the Sun 
being located slightly above the Galactic plane. We do not find a 
similar distribution toward the outer Galaxy. In fact, about 60% 
of the high resolution objects are at positive latitudes (a!50% 
for low resolution). However, this can be attributed to the local, 
well-populated. Vela and Cygnus/Cepheus star forming regions 
which are located mainly above the plane. 

The 2-dimensional spatial distribution is displayed in the up- 
per panel. The red and blue ellipses display the location and lin- 
ear extent of the high and low resolution regions, respectively. 
Some well-known areas associated with molecular clouds and 
star forming activity in the outer Galaxy are indicated. Most 
of the regions we find are well confined within the extent of 
these large complexes, and by far most regions are identified in 
the crowded spiral arm regions near Vela and Cygnus/Cepheus. 
Toward lower stellar densities (/ « 140°-240°) there are more re- 
gions identified with the low resolution than the high resolution 
grid because at high resolution there are often insufficient stars 
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Fig. 5: Spatial distribution of extended red regions identified in the outer galactic plane at 60" (solid histogram, red regions) and 90" 
(dashed histogram, blue regions) resolution. The regions are identified with a reliability of >99.9%. A few well-known star forming 
regions in the Galaxy are given for reference. 
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in a cell to use for the U-test. Many of the large-scale regions 
identified at low resolution, in particular visible toward the Vela 
and Cygnus/Cepheus regions, are identified as groups of smaller 
regions with the high resolution grid. 

5.2. Column density sensitivity 

The objects in the catalog are identified without using a param- 
eter that relates to a physical quantity. However, we do have in- 
formation on the average (H - K^) colors of cells in all objects 
and we use it to derive a rough estimate of the column density 
sensitivity. Note however, that we can not account for any cor- 
rections due to contamination of unreddened foreground stars 
and the values below represent only lower limits. 

The average color in the cells of identified regions is <{H - 
/rs)>~0.54 mag with a standard deviation of crxiO.lS mag. Some 
cells have an average color up to =s2 mag. These value can be 
converted to a visual extinction Ay using the following equa- 
tions; 



l5.9xE(H-Ks) 



(iRieke & Lebofskvl[T98l . 



E(H - Ks) - (H - /Ts-observed) - (H - /Ts-intrinsic ) ■ 



(5) 



(6) 



Table 1: Excerpt from the catalog of extended red objects. 



where E{H - K^) represents the color excess, (H - /Ts-observed) 
the observed color in a cell and (H - /Ts-inmnsic) the intrinsic 
color For the intrinsic color we use a value of 0.15 mag (Sec. 
|2]l. This means we are sensitive to visual extinction values rang- 
ing from a few to s;30 magnitudes, corresponding to molecular 
column densities up to N(H2)~3xlO^^ cm"^ (Bohlin e t al. 197 ^. 
This is comparable to values found for the majority of IRDCs 
(iPeretto & Fulleill2009h . 
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6. Conclusions 

We have used a statistical approach in combination with a 
friends-of-friends algorithm to measure deviations in the spatial 
{H - Ks) color distribution of stars in the outer galactic plane, 
with the goal to identify extended reddened regions. We pro- 
cessed the galactic plane at 60" and 90" resolution, resulting 
in the identification of 1320 high resolution and 1589 low res- 
olution extended red regions. The reliability of individual red 
cells that make up each object is 99% and by using a friend- 
of-friends approach, the resulting catalog of objects is 99.9% 
reliable. Because we have put the effort in producing a highly 
reliable source list we cannot quantify the completeness of the 
catalog. 

The majority of the objects are located toward well-known 
molecular cloud complexes and some correspond to specific, 
well-known objects such as NGC 7538 and S 159. This corre- 
lation strengthens the argument that the red nature of the regions 
is caused by extinction and/or embedded young stellar objects 
rather than intrinsic star color variations. The main goal of our 
analysis is to find previously undetected dark clouds in the outer 
Galaxy. However, many other types of objects are included in the 
catalog and the results may be compared to other outer Galaxy 
studies in the future. 

Based on the parameters derived from the statistical test it 
is impossible to determine the nature of the reddening. The next 
step is to cross-correlate the objects with existing optical, in- 
frared and CO data. This process is described in an accompany- 
ing paper (Frieswijk et al. 2010). 
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